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Abstract. We provide an extension of the perfect sampling algorithm 
of Fill (1998) to general chains, and describe how use of bounding pro- 
cesses can ease computational burden. Along the way, we unearth a 
simple connection between the Coupling From The Past (CFTP) algo- 
rithm originated by Propp and Wilson (1996) and our extension of Fill's 
algorithm. 



1 Introduction 

Markov chain Monte Carlo (MCMC) methods have become extremely popu- 



lar for Bayesian inference problems (consult, e.g., Gelfand and Smith [16], Smith 
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and Roberts [36|, Tierney |38{ , Gilks et al. [pTj), and for problems in other ar- 



eas, such as spatial statistics, statistical physics, and computer science (see, e.g., 
or Propp and Wilson |32] for pointers to the literature) as a way of sam- 

An 
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pling approximately from a complicated unknown probability distribution tt. 
MCMC algorithm constructs a Markov chain with one-step transition kernel K and 
stationary distribution tt; if the chain is run long enough, then under reasonably 
weak conditions (cf. Tierney p^) it will converge in distribution to tt, facilitating 
approximate sampling. 

One difficulty with these methods is that it is difficult to assess convergence to 
stationarity. This necessitates the use of difficult theoretical analysis (e.g., Meyn 
and Tweedie |l27|l, Rosenthal fSSf) or problematic convergence diagnostics (Cowles 



and Carlin |§, Brooks, et al. |2|) to draw reliable samples and do proper inference 
An interesting alternative algorithm, called cou pling from the past (CFTP), was 
introduced by Propp and Wilson (see also |33| and J 34 1) and has been studied 
and used b y a number of authors (including Kendall 
and Green |30|, Foss and Tweedie ll^, Kendall and Thonnes 
Tweedie 



Green and Murdoch 118 



M0ller 



25 



31 



I) 



28|, Murdoch 
Corcoran and 
By searching 



Murdoch and Rosenthal 
backwards in time until paths from all starting states have coalesced, this algorithm 
uses the Markov kernel K to sample exactly from tt. 

Another method of perfect simulation, for finite-state stochastically monotone 
chains, was proposed by Fill 
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Fill's algorithm is a form of rejection sampling. 
This algorithm was later extended by M0ller and Schladitz |29 and Thonnes |37| 
to non-finite chains, motivated by applications to spatial point processes. Fill's 
algorithm has the advantage over CFTP of removing the correlation between the 
length of the run and the returned value, which eliminates bias introduced by an 
impatient user or a system crash and so is "interruptible" . However, it has been 
used only for stochastically monotone chains, making heavy use of the ordering of 
state space elements. In his paper, Fill [11| indicated that his algorithm could be 
suitably modified to allow for the treatment of "anti-monotone" chains and (see 
his Section 11.2) indeed to generic chains. In this extended abstract we present a 
version of Fill's algorithm for generic chains; we, too, will provide an explanation 
in terms of rejection sampling. We have strived to keep to the spirit of the talks 
presented at the Workshop on Monte Carlo Methods held at the Fields Institute 
for Research in Mathematical Sciences in Toronto in October, 1998 and make our 
results accessible to a broad audience. Technical details are provided in the full 
paper 
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Following is our interruptible algorithm for generic chains. We discuss some of 
the terminology used and other details of the algorithm in Section |^. 

Algorithm 1.1 Choose and fix a positive integer i, choose an initial state Xt 
from any distribution absolutely continuous with respect to tt, and perform the fol- 
lowing routine. Run the time-reversed chain K for t steps, obtaining Xj, Xt_i, . . . , 
Xq in succession. Then, reversing the direction of time, generate (possibly de- 
pendent) Markov chains, one [say Y{x) = (Yo(a::),... ,Yt{x)), with Yo(a;) — x] 
evolving from each element x of the state space X and each with transition ker- 
nel K. All these trajectories are to be multivariately coupled ex post facto with the 
trajectory X = (Xq, . . . ,Xt), which is regarded as a trajectory from K; in partic- 
ular, Y(Xo) = X. Finally, we check whether all the values Yt[x), x ^ X, agree. If 
they do, we call this coalescence, and the value Xq is accepted as an observation 
from TT. If not, then the value Xq is rejected and so the routine fails. We then start 
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the routine again with an independent simulation, perhaps with a fresh choice of t 
and Xt, and repeat mitil the algorithm succeeds. 

Here is a simple intuitive explanation of why the algorithm works correctly. 
Imagine (1) starting the construction with Xq ~ tt and independently (2) building 
all of the paths Y(a;) [including X = Y(Xo)] simultaneously. Since determination 
of coalescence and the value of the coalesced paths at time t rely only on the second 
piece of randomness, conditionally given coalescence to Xt = z (for any z) we will 
still have Xq ^ tt, as desired. The algorithm builds the randomness in a different 
order, starting from X^. 



Remark 1.2 (a) Note that no assumption is made in Algorithm 1.1 concerning 
monotonicity or discreteness of the state space. 

(b) See Q for the definition of K. 

(c) To couple all of the trajectories Y(a;) ex post facto with the trajectory X 
means first to devise a multivariate coupling for all the trajectories by means of a 
transition rule and then to employ that coupling conditionally having observed the 
single trajectory Y(Xo) = X. Just how this is done is described in Section 3. 

(d) If coalescence occurs, then of course the common value of Yt{x), x ^ X, is 
the initial state Xt. 

(e) We have reversed the direction of time, and the roles of the kernels K and K, 



compared to Fill [11|. Furthermore, Fill's original algorithm also incorporated a 
search for a good value of t by doubling the previous value of t until the first success. 
For the most part, we shall not address such issues, instead leaving the choice of t 



entirely up to the user; but see Section 5.2 



(f) We have included the technical absolute continuity restriction on the dis- 
tribution of Xt to ensure correctness. In a typical application, one might know a 
density for tt with respect to some measure A (for example, A = Lebesgue measure) 
up to a normalizing constant. Then if, for example, that density is positive on the 
entire state space X, one need only take Xt to have any distribution having density 
with respect to A (for example, a normal distribution). 

Also, if the state space is discrete, or if it is continuous and all probabilistic 
ingredients to the algorithm (such as the kernel K) are sufficiently smooth, then 
the user may choose Xt deterministically and arbitrarily. 



As mentioned above, we will discuss the details of Algorithm Fl in Section g|. 
First, in Section ^, we motivate our algorithm in the context of a rather general 
rejection sampling framework. A more rigorous treatment may be found in the full 



paper [14 



In Section ^ we discuss how the computational burden of tracking all of the 
trajectories Y(x) can be eased by the use of coalescence detection events in general 
and bounding processes in particular; these processes take on a very simple form 
(see Section ^!^) when the state space is partially ordered and the transition rule 
employed is monotone. In the full paper we present a computationally efficient 



modification of Algorithm 1.1 that applies when K is assumed to be stochasti- 
cally monotone. (As discussed in Fill and Machida |13{ and Machida [ p6[ [ , this is a 
weaker assumption than that of the existence of a monotone transition rule.) When 
the state space is finite, the algorithm for the stochastically monotone case reduces 
to Fill's original algorithm. The full paper also discusses a "cross-monotone" gen- 
eralization of stochastic monotonicity. 
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In Section |^ we compare Algorithm 1.1 and CFTP. We also demonstrate a sim- 



ple connection between CFTP and an infinite-time- window version of Algorithm 



(namely, Algorithm 5.1 



1.1 



This extended abstract discusses only algorithms for generating a single ob- 
servation from a distribution tt of interest. In the full paper, we discuss various 
strategies for perfect generation of samples of arbitrary size from tt. 

The goal of this extended abstract is to describe how to apply Fill's perfect 
sampling algorithm to a broad spectrum of problems of real interest, not to develop 
any specific applications in detai l. B ut an underlying theme here is that while our 
extended algorithm (Algorithm LI) tends to be computationally more intricate 
than CFTP (see Section 5T), theoretically it is as broadly applicable as is CFTP. 
In this spirit, we will point to the applied perfect sampling literature at appropriate 
junctures, taking note of past applications of both CFTP and Fill's algorithm as 
examples. We hope that our extension of the latter algorithm will stimulate further 
research into this less-used alternative for perfect MCMC simulation. 

A valuable background resource is the annotated bibliography of perfect sam- 
pling maintained by Wilson |39]. 



2 Rejection sampling using auxiliary randomness 

Given a bivariate distribution C{X,X'), suppose that, for each x', we can 
simulate X from the conditional distribution C{X\X' = x'). How can we simulate X 
from its marginal distribution vr ;= C{X)1 This is the problem confronting us in 



the context of Algorithm 1.1, with {X,X') :— (Xo,Xf) and Xf chosen according 
to TT. Indeed, we assumed there that the user can simulate from £(Xo | Xf = x') = 
K*{x' , •), where JsT* is the t-step time-reversal transition kernel; and, when Xt ^ tt, 
we have £(Xo) = tt. Of course, if the user could simulate Xt ^ tt, then either Xt 
or Xo could be used directly as an observation from tt. But, for MCMC, this is an 
unreasonable assumption. 

So we turn to rejection sampling (e.g., Devroye |Q), done conditionally given 
X' — x' . Our goal is to use the observation x from C[X\X' = a;') to simulate one 
from TT. This can be done by accepting x as an observation from tt with probability 

■K(dx) 

""'p(Xedx|X' = x') 

for any oix' chosen to make < ^x' .x < 1; indeed, then 

, , / X P(X e dx, accept I X' = x') ax'iridx^ , 

P{X edx\X' = x' , accept) = ^ „, = ,\ /, = 7r{dx). 

^ ' ' ^ P(accept I X' = x') ax' J Tr{dy) ^ ' 

The question remains how to engineer a, coin- flip with probability '7a:',a: 

of heads, 

given that one can rarely compute ^x' ,x in practice. 
However, if we can find an event C so that 

P(C\X' = x' ,X^x)^^x-,x for all x', a;, (2.1) 



then we need only accept when (and only when) C occurs. Condition ( |2.lD requires 
precisely that 

F({X' e S'} n C n {X G S}) = 7r(B) / ax'P{X' edx') for all (2.2) 

JB' 
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Note that if we can choose C so that condition (2.2) holds, then setting B to be 
the entire state space X for X, we obtain 

P{{X' e B'}r\C) = [ a^>P{X'edx') for all B' 
Jb' 

and hence 

Pi{x' e B'}ncr\{x e B}) = tt{b)p{{x' e B'}nc) foraiis,s'. (2.3) 

Conversely, if we can choose C so that condition ( ^.3| ) holds, then we can choose 

a,. ■.= P{C\X' = x') 

to satisfy (O) and (U). 



How might we choose C to satisfy (2.3)? Observe that if C and another vari 



able Y are such that (i) Y = X' over the event C and (ii) X and the pair (Y, C) 
are independent, then 



LHS(|2j) = P{{x' B'}r\C r\{x B}) = P{{Y <^ B'}c^c r\{x e B}) 
= P{{Y eB'}r\C)Ti{B). 

Setting B — X and then substituting, we obtain (2.3). The Algorithm 1.1 is 
designed precisely so that we may take iY,C) :— (Yf(a;Q), {coalescence}) for an 
arbitrarily chosen (bu t fixed) Xq G ^ to satisfy (i) and (ii). This is discussed 



further in Section 3.1 below 



3 Details for Algorithm 1.1 
3.1 The ingredients. The goal of this subsection is to describe in some detail 



1.1 



how to apply Algorithm 

The space {X^B): It is sufficient that {X,B) be a complete separable metric 
space. In particular, this covers the case that X is discrete or Euclidean. See 



Section 3.1 of the full paper [14| for further discussion 



The kernel K and its time-reversal K: Let K he & Markov transition kernel 
on X . The kernel is chosen (by the user) so that tt is a stationary distribution, i.e., 
so that 



TT{dx)K(x, dy) = n{dy) on X. 

X 

The time-reversed kernel K (also on X) is then defined by 

'K{dx)K{x, dy) = Tr{dy)K{y, dx) on X X . (3.1) 

The transition rule (j): There exists a transition rule which can be used to drive 
the construction of the Markov chain of interest. More precisely, there exists a 
probability space {U, T, /i) and a (suitably measurable) function (p : X xU ^ X 
such that 

K{x,B) = ^l{u■. (j){x,u) B], xeX, BeB. (3.2) 

Such (j) (with accompanying fi) is sometimes called a transition rule. We choose 
and fix such a (0, /i). 

Remark 3.1 (a) A transition rule (p can always be found that uses {U, J-, fj,) = 
([0, 1], Borels, uniform distribution). In the special case that X is the unit interval, 
we can in fact use 

0(x, u) = G{x, u) 
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where G{x, •) is the usual inverse probability transform corresponding to the dis- 
tribution function u i~> K{x, [0,u\). 

(b) If X is discrete (finite or countably infinite) , a very simple alternative choice 
is the following "independent-transitions" transition rule. Let U — X"^ , let /i be 
product measure with xth marginal K{x, ■) (x S X), and let (fi be the evaluation 
function 

(t>{x, u) :— u{x). 

(c) Many interesting examples of transition rules can be found in the literature, 
including Diaconis and Freedman [^] and the references cited in Section 

(d) Usually there is a wealth of choices of transition rule, and the art is to find 
one giving rapid and easily detected coalescence. Without going into details at this 
point, we remark that the transition rule in (b) usually performs quite badly, while 
transition rules having a certain monotonicity property will perform well under 
monotonicity assumptions on K. 

The Markov chain and a first probability space: We will need to set up two 
probability spaces. The first space {n,A,P) — designated in ordinary typeface — 
will be useful for theoretical considerations and for the computation of certain 
conditional probability distributions. The second space (J^, A, P) — designated in 
boldface type — will be the probability space actually simulated when the algorithm 
is run. All random variables defined on the first space (respectively, second space) 
will also be designated in ordinary typeface (resp., boldface type). We have cho- 
sen this notational system to aid the reader: Corresponding variables, such as Xq 
and Xq, will play analogous roles in the two spaces. 

From our previous comments it is easy to see that there exists a space (U,J- ), 
a transition rule {4>,iJ,), and a probability space {n,A,P) on which are defined 
independent random variables Xq, Ui,U2, . ■ ■ ,Ut with Xq ^ tt and each Us ^ fJ.. 
Now inductively define 

X, :=0(X,_i,[/,), l<s<t. (3.3) 

Then X :— {Xq, ... ,Xt) is easily seen to be a stationary Markov chain with ker- 
nel K, in the sense that 

P{Xq G dxo, . . . , At G dxt) — TT{dxQ)K{xQ, dxi)- ■ ■K{xt~i, dxt) on X''^^^ . 

(3.4) 

In fact, for each x £ X we obtain a chain with kernel K started from x by defining 
Yq{x) := X and, inductively, 

Ys{x) ■.= <j>{Ys-i{x),Us). 
Let Y(x) := {Yq(x), ... ,Yt(x)). In this notation we have Y{Xq) = X. Let 

C :~ {Yt{x) does not depend on x} (3.5) 

denote the event that the trajectories Y{x) have all coalesced by time t. Fixing 
Xq G X arbitrarily and taking 

A = Ao, X'^Xt, Y = Ytix*Q), Casat(U) 

to match up with the notation in Section ^, key observations are that, for any 
(measurable) subset B oi X, Y — X' over the event C, and A and the event 
{Y G B}r)C are independent. The independence here follows from the fact that Aq 
and U := {Ui, . . . , Ut) have been chosen to be independent. 



Extension of Fill's perfect rejection sampling algorithm to general chains (EXT. ABS.) 7 



A second probability space and the algorithm: We make use of the auxiliary 
randomness provided by Xi, . . . , Xt_i and U and compute conditional probability 



distributions for the first probability space in stages. First observe from (3^) and 



repeated use of (3J^) that 



P{X(, e dxo, . . . , Xt^i e dxt-i I Xt = xt) = K{xt,dxt-i) ■ ■ ■ K{xi,dxo). 



Next, we will discuss in Section 3.2 how to compute C{U \ X — x). Finally, whether 
or not C occurs is determined solely by the randomness in U, so the conditional 
probability of C given {X,U) — {x, u) is degenerately either or 1. 

Moreover, our discussion has indicated how to set up and simulate the sec- 
ond space. As discussed in Section ^ we assume that the user knows enough 
about TT qualitatively to be able to simulate Xt so that £p(Xt) ^ tt. Having 
chosen Xt = xt, the user draws an observation Xt_i = Xt-i from K{xt,-), then 
an observation Xt_2 — xt-2 from K(xt-i, ■), etc. Next, having chosen X = ^ 
[i.e., (Xp,... ,Xt) — (.To,... ,xt)], the user draws an observation U = u from 
£,{U\X = x) and constructs Y(a;) — (Yo(x),... ,Yt{x)) by setting Yo(a;) := x 
and, inductively, 

Y,{x) :==0(Y,_i(t),U,). 

Finally, the user declares that C, or coalescence, has occurred if and only if the 
values of Yt(x), x ^ X, all agree. The conditional distribution of output from 



Algorithm 1.1 given that it ultimately succeeds (perhaps only after many iterations 



of the basic routine) is tt, as desired. 

Remark 3.2 (a) If P(C) > for suitably large i, then ultimate success is 
(a.s.) guaranteed if the successive choices of t become large. A necessary condition 
for ultimate positivity of P(C) is uniform ergodicity of K . This condition is also 
sufficient, in the (rather weak) sense that if K is uniformly ergodic, then there 
exists a finite i nteg er m and a transition rule 0„i for the m-step kernel _fC™ such 



that Algorithm LI, applied using i/)™, has P(C) > when t is chosen sufficiently 



large. Compare the analogous Theorem 4.2 for CFTP in Foss and Tweedie |15| . 



(b) Just as discussed in Fill [11] (see especially the end of Section 7 there), 
the algorithm (including its repetition of the basic routine) we have described is 
interruptible; that is, its running time (as measured by number of Markov chain 
steps) and output are independent random variables, conditionally given that the 
algorithm eventually terminates. 

(c) If the user chooses the value of Xj (= z, say) deterministically, then all that 
can be said in general is that the algorithm works properly for 7r-a.e. such choice. 
In this case, let the notation P2(C) refiect the dependence of P(C) on the initial 
state z. Then clearly 



j P4C)^(dz) = P(C), 



which is the unconditional probability of coalescence in our first probability space 
and therefore equal to the probability that CFTP terminates over an interval of 



width t. This provides a first link between CFTP and Algorithm 1.1. (Very) roughly 



recast, the distribution of running time for CFTP is the stationary mixture, over 



initial states, of the distributions of running time for Algorithm 1.1. For further 



elaboration of the connection between the two algorithms, see Section 5.3 
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3.2 Imputation. In order to be able to run Algorithm 1.1, the user needs to 
be able to impute U from X, i.e., to draw from C{U \ X — x). In this subsection 
we explain how to do this. We begin with the computation 

P{U edu\x = x) 

= P{U e du\XQ ^ xo, (l){xo,Ui) = xi, . . . , (j){xt-i,Ut) = xt) by 

— P{U ^ du\(j){xQ,Ui) — xi, . . . , 4'{xt~i,Ut) = Xt) by indep. of Xo and J7 

= P{Ui e dui I 4){xq, J/i) = xi) X • • • X P{Ut e dut I (t){xt-i,Ut) = xt) 

by independence of C/i, . . . ,Ut 
= P{Ui e dui I (j){xQ, t/i) = xi) X • • ■ X P{Ui e dut I (j){xt^i,Ui) = Xt) 

since C/i, . . . , C/j are identically distributed 
= P{Ui e dui \ Xq = xq, Xi ^ xi) y. ■ ■ ■ X P{Ui £ dut \ Xq = xt-i, Xi = xt), 

where the last equality is justified in the same fashion as for the first two. This 
establishes 

Lemma 3.3 The t-fold product of the measures 

P{Ui e dui I Xq = xq, Xi = xi), . . . , P{Ui G dut \ Xq = xt-i, Xi = xt) 

serves as a conditional probability distribution P{U £ du\X — x). 

In setting up the second probability space, therefore, the user, having chosen 
X = if, draws an observation U = m by drawing Ui , . . . , Ut independently, with 
Us chosen according to the distribution C(Ui \ Xq = Xs-i, Xi = x^). 

Remark 3.4 (a) If X is discrete, suppose we use the "independent-transitions" 
rule (j) discussed in Remark |3.l| (b). Then the measure n, but with the xpth marginal 
replaced by Sx^^, serves as C{Ui \ (j>{xo, Ui) — xi) — C{Ui \ Ui{xo) = xi) and there- 
fore as C(Ui I Xq — Xq, Xi — xi). Informally stated, having chosen Xj = Xs and 
X<;_i — Xs_i, the user imputes the forward-trajectory transitions from time s — 1 



to time s in Algorithm 1.1 by declaring that the transition from state Xs-i is to 
state Xs and that the transitions from other states are chosen independently ac- 
cording to their usual non-X-conditioned distributions. 

(b) As another example, suppose that X = [0, 1] and we use the inverse proba- 



bility transform transition rule discussed in Remark 3.1(a). Suppose also that each 
distribution function F{xo, ) = K{xo,[0,-]) is strictly increasing and onto [0,1]. 
Then Spi^xo,xi} serves as C{Ui \ Xq = xq, Xi = xi). Informally stated, a generated 
pair (Xs,Xs_i) = {xs,Xs-i) completely determines the value F(xs-i, x^) for U^. 



3.3 A toy example. We illustrate Algorithm 1.1 for a very simple example 
and two different choices of transition rule. Consider the discrete state space X — 
{0, 1, 2}, and let tt be uniform on X. Let K correspond to simple symmetric random 
walk with holding probability 1/2 at the endpoints; that is, putting k(x,y) :— 
Kix,{y}), 

k{0, 0) = fc(0, 1) = A:(l, 0) = fc(l, 2) k{2, 1) = fc(2, 2) = 1/2, 
k{0,2) = fc(l,l) = k{2,0) = 0. 

The stationary distribution is tt. As for any ergodic birth-and-death chain, K is 
reversible with respect to tt, i.e., K = K. Before starting the algorithm, choose a 
transition rule; this is discussed further below. 
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For utter simplicity of description, we choose t = 2 and (deterministically) 
Xt = (say); as discussed in Section |l|, a deterministic start is permissible here. 
We then choose Xi ~ K{0, •) and Xq | Xi ^ K(X.i, •). How we proceed from this 
juncture depends on what we chose for 0. 

One choice is the independent-transitions rule discussed in Remarks p!l](b) 
and ^!^(a). The algorithm's routine can then be run using 6 independent random 
bits: these decide Xi (given X2), Xq (given Xi), and the 4 transitions in the second 
(forward) phase of the routine not already determined from the rule 

Xs-i ^ Xs from time s — 1 to time s (s = 1, 2). 

There are thus a total of 2^ = 64 possible overall simulation results, each hav- 
ing probability 1/64. We check that exactly 12 of these produce coalescence. Of 
these 12 accepted results, exactly 4 have Xq = 0, another 4 have Xq — 1, and a fi- 
nal 4 have Xq = 2. Thus P(C) = 12/64 = 3/16, and we confirm that Cp^ O^o) = 7^, 
as should be true. An identical result holds if instead we choose Xt = 1 or Xf = 2. 



An alternative choice adapts Remarks 3.1(a) and 3.4(b) to our discrete setting. 
Note that we can use (iY, /i) = ({0, 1}, uniform) and 

{the mapping taking 0, 1, 2 to 0, 0, 1, respectively, if m = 
the mapping taking 0, 1, 2 to 1, 2, 2, respectively, if m = 1. 

Choosing t = 2 and Xf = as before, the algorithm can now be run with just 2 
random bits. In this case we check that exactly 3 of the 4 possible simulation 
results produce coalescence, 1 each yielding Xq = 0, 1, 2. Note that P(C) = 3/4 is 
much larger for this choice of (j). In fact, since </> is a monotone transition rule (see 



Definition 4.2 in Fill |11 or Definition 4.3 below), for the choice Xf = it gives the 



highest possible value of P(C) among all choices of 0: see Remark 9.3(e) in Fill [11 
It also is a best choice when Xf = 2. [On a minor negative note, we observe that 
P(C) — for the choice Xi — 1. Also note that the tt = (1/3, 1/3, l/3)-average 
of the acceptance probabilities (3/4,0,3/4), namely, 1/2, is the probability that 
forward coupling (or CFTP) done with the same transition rule gives coalescence 



within 2 time units; this corroborates Remark 3.2(c).] 



Remark 3.5 Both choices of (p are easily extended to handle simple symmetric 
random walk on {0, ... , n} for any n; the second (monotone) choice is again best 



possible. (We assume Xf — 0.) For fixed c G (0, 00) and large n, results in Fill |ll 
and Section 4 of Diaconis and Fill [Q] imply that, for t — cn^, the routine's success 
probability is approximately p(c); here p{c) increases smoothly from to 1 as c 
increases from to 00. We have not attempted the corresponding asymptotic 
analysis for the independent-transitions rule. Of course our chain is only a "toy" 
example anyway, because direct sampling from tt is elementary. 

4 Coalescence detection and bounding processes 

4.1 Conservative detection of coalescence and detection processes. 

Even for large finite state spaces <¥, determining exactly whether or not coalescence 
occurs in Algorithm can be prohibitively expensive computationally; indeed, in 
principle this requires tracking each of the trajectories Y(x), x G X to completion. 



However, observe that if we repeat the development of Sections 3.1-3.2, replacing 



the coalescence event {Yt{x) does not depend on x} of (3.5) by any subset C of this 
event whose occurrence (or not) can still be determined solely from U , then every- 
thing goes through as before. We call such an event C a coalescence detection event 
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and reiterate that C is a conservative indication of coalescence: the occurrence of a 
given coalescence detection event is sufficient, but not necessary, for the occurrence 
of coalescence of the paths Y{x). 

In practice, a coalescence detection event is constructed in terms of a detection 
process. What we mean by this is a stochastic process D = {Dq, ... ,Dt), defined 
on the same probability space {n,A,P) as U and X, together with a subset A of 
its state space T>, such that 

(a) -D is constructed from U, and 

(b) {Ds e A for some s < t} C {Yt{x) does not depend on x}. 
Then C := {Ds G A for some s < is a coalescence detection event. 

Remark 4.1 In practice, D usually evolves Markovianly using U; more pre- 
cisely, it is typically the case that there exists deterministic do ^ V and 6 : VxU ~> 



V such that Dq = do and [paralleling (3.3)] 



Ds:^S{Ds-i,Us), l<s<t. 

The important consequence is that, having determined the trajectory X and 
the imputed U, the user need only follow a single trajectory in the forward phase of 
the routine, namely, that of D (or rather its analogue D in the simulated probability 
space). 

Example 4.2 We sketch two illustrative examples of the use of detection pro- 



cesses that do not immediately fall into the more specific settings of Sections [4.2 
or 



■ 4^. We hasten to point out, however, that because of the highly special structure 
of these two examples, efficient implementation of Algorithm avoids the use of 
the forward phase altogether; this is discussed for example (a) in Fill 



12 



(^) Our first example is provided by the move-to-front (MTF) rule studied 



12 1 . Let K be the Markov kernel corresponding to MTF with independent and 



identically distributed record requests corresponding to probability weight vector 



{wi,... see (2.1) of [12| for specifics. The arguments of Section 4 of |12 

show that if Dg is taken to be the set of all records requested at least once among 
the first s requests and A is taken to consist of all {n — l)-element subsets of the 
records 1, . . . , n, then Z) is a detection process. Similar detection processes can be 
built for the following generalizations of MTF: move-to-root for binary search trees 



(see Dobrow and Fill |g| |10|) and MTF- like shuffles of hyperplane arrangement 
chambers and more general structures (see Bidigare, et al. [0 and Brown and 
Diaconis [||). 

(b) A second example of quite similar spirit is provided by the (now well- 
known) Markov chain (Xt) for generating a random spanning arborescence of the 
underlying weighted directed graph, with vertex set 14, of a Markov chain (Ut) with 
kernel q. Consult Propp and Wilson (who also discuss a more efficient "cycle- 
popping" algorithm) for details. We consider here only the special case that (Ut) is 
an i.i.d. sequence, i.e., that q{v,w) = q{w). A transition rule for the chain (Xt) 
is created as follows: for vertex u and arborescence x with root r, (p{x, u) is the 
arborescence obtained from x by adding an arc from r to u and deleting the unique 
arc in x whose tail is u. Then it can be shown that if Dg is taken to be the set 
of all vertices appearing at least once in {Ui, . . . , Us) and A := {U}, then D is a 
detection process. 
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4.2 Bounding processes. We obtain a natural example of a detection pro- 
cess D when (a) D is constructed from C/, (b) the corresponding state space V is 
some collection of subsets of with 

A:={{x'}: ^e'gA-}, 

and 

(c) Ds 2 {Ys{x) : xeX}. 

The concept is simple: in this case, each set Dg is just a "conservative estimate" 
(i.e., a superset) of the corresponding set {Ys{x) : x G X} of trajectory values; 
thus if Dg = {x'}, then the trajectories Y{x) are coalesced to state x' at time s and 
remain coalesced thereafter. We follow the natural impulse to call such a set- valued 
detection process a bounding process. Such bounding processes arise naturally in 
the contexts of monotone and anti-monotone transition rules (and have been used 
by many authors) : see the next subsection. Other examples of bounding processes 
can be found in works of Huber: see [^o| and in connection with CFTP and |22| 
in connection with our algorithm. 

Of course, nothing is gained, in comparison to tracking all the trajectories, by 
the use of a bounding process unless the states of T) have more concise representa- 
tions than those of generic subsets of X; after all, we could always choose V = 2'^ 
and Ds — {Ys{x) : x G X}. One rather general setting where compact represen- 
tations are often possible, discussed in the next subsection, is that of a partially 
ordered set (poset) X. 

4.3 Monotone transition rules. We now suppose that X is equipped with a 
partial order. We also assume here that there exist (necessarily unique) elements 
and 1 in X (called bottom element and top element^ respectively) such that < 
X < 1 for all X ^ X. We will discuss the case of monotone transition rules, where 
we can build from U a bivariate process {{L^, Vs)), taking values at each time s in 
X X X, such that 

Ls<Ys{x)<Vs for aU < s < i and aU X G A:". (4.1) 

Then Dg := [Lg, Vg] = {x G X : Lg < x < Vg} gives a bounding process, and the 
pair (Lg, Vg) is a quite concise representation of Dg. 

Recall that our construction (|3.3|) of the Markov chain X with kernel K relies 



on the choice of a transition rule (</>, /J.) satisfying (3.2). 

Definition 4.3 A transition rule (j) is said to be monotone if each of the map- 
pings 

4i{-,u) : X ^ X, u £U, is monotone increasing, i.e., if 

(/){x, u) < (/)(y, u) for all u € U 

whenever x < y. 

Suppose now that is monotone. Set {Lq,Vo) := (0,1) and, inductively, 

{Lg,Vg) := {^{Lg^i,Ug),4>{Vg-uUg)), l<s<t. 

One immediately verifies by induction that ( [l.l| ) is satisfied. Note that [Lg,Vg) 
is determined solely by U (as is the coalescence detection event C — {Lt — Vt}) 
and is nothing more than (Kj(0), Ys(l)). In plain language, since monotonicity is 
preserved, when the chains Y{Q) and Y{1) have coalesced, so must have every Y{x). 
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Remark 4.4 (a) Lower and upper bounding processes can also be constructed 
when Algorithm 1.1 is applied with a so-called "anti- monotone" transition rule; we 
omit the details. See Haggstrom and Nelander Huber 1211, Kendall 

M0ller 



23 



28 



See Haggstrom and Nelander [19|, Huber 
M0ller and Schladitz [29|, and Thonnes 1 371 for further discussion 



in various specialized settings. There are at least two neat tricks associated with 
anti-monotone rules. The first is that, by altering the natural partial order on X, 
such rules can be regarded, in certain bipartite-type settings, as monotone rules. 



in which case the performance analysis in Section 5.3 of |14] is available: consult 



Section 3 of |19], the paper |29|, and Definition 5.1 in |37 . The second is that the 



poset X is allowed to be "upwardly unbounded" and so need not have a 1: consult 
Section 2 of 



28 1 and, again, p9l and [37 



(b) Dealing with monotone rules on partially ordered state spaces without 1 is 
problematic and requires the use of "dominating processes." We comment that a 
dominating process provides a sort of random bounding process and is useful when 
the state spa ce is noncompact, but we shall not pursue these ideas any further here. 
See Kendall ^ and Kendall and M0ller [|^ in the context of CFTP; we hope to 
discuss the use of dominating processes for our algorithm in future work. 

5 Our algorithm and CTFP: comparison and connection 



C om parison. How does our extension of Fill's algorithm, as given by 
and discussed in detail in Section ||, compare to CFTP? As we see 



1.1 



11 



a pri- 



5.1 

Algorithm 

it, our algorithm has two main advantages and one main disadvantage. 

Advantages: As discussed in Section ^ and Remark |3.2| (b) and in 
mary advantage of our algorithm is intcrruptibility. Given the close connection 
between the algorithms described in Section |5.3| , one may reasonably view the ex- 
tra computational costs of our algorithm (see "Disadvantage^^ below) as the costs 
of securing intcrruptibility. A related second advantage concerns memory alloca- 
tion. Suppose , for example, that our state space X is finite and that e ach t ime-step 
of Algorithm 



1.1 



including the necessary imputation (recall Section 3.2), can be 
carried out using a bounded amount of memory. Then, for fixed t, our algorithm 
can be carried out using a fixed finite amount of memory. Unfortunately, it is rare 
in practice that the kernel K employed is sufficiently well analyzed that one knows 
in advance a value of t (and a value of the seed Xt) giving a reasonably large prob- 
ability P(C) of acceptance. Furthermore, the fixed amount of memory needed is 
in practice larger than the typical amount of memory allocated dynamically in a 
run of CFTP. Finally, wc should note that Wilson |40| has very recently presented 
a version of CFTP which also can be carried out with a fixed finite amount of 
memory, and which docs not require an a priori estimate of the mixing time of the 
chain. 

Disadvantage: A major disadvantage of our Algorithm |l.l| concerns computa- 



tional complexity. We refer the reader to |11| and |12 for a more detailed discus- 
sion in the setting of our Section ^ (and, more generally, the setting of stochastic 
monotonicity) . Briefly, if no attention is paid to memory usage, our algorithm has 
running time competitive with CFTP: cf. Remark 3.2(c), and also the discussion 
in Remark 9.3(e) of |11 that the running time of our algorithm is, in a certain 
sense, best possible in the stochastically monotone setting. However, this analy- 
sis assumes that running time is measured in Markov chain steps; unfortunately, 
time-reversed steps can sometimes take longer than do forward steps to execute 
(e.g., |12|), and the imputation described in Section p]3 is sometimes difficult to 
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carry out. Moreover, the memory usage for naive implementation of our algorithm 
can be exorbitant; how to trade off speed for reduction in storage needs is described 
in 



11 



5.2 An alternative to Algorithm 1.1 . T hus far we have been somewhat 
sketchy about the choice (s) of t in Algorithm LI. As discussed in Section |l|, one 



possibility is to run the repetitions of the basic routine independently, doubling t 
at each stage. However, another possibility is to continue back in time, reusing 
the already imputed values and checking again for coalescence. (There is an 



oblique reference to this alternative in Remark 9.3 of Fill [11|.) This idea leads to 
the following algorithm. 

Algorithm 5.1 Choose an initial state Xq ~ tt, where tt is absolutely contin- 
uous with respect to tt. Run the time-reversed chain K, obtaining Xq, X_i, ... in 
succession. Conditionally given (Xq, X„i, . . . ) = (a;o, X-i, ■ ■ ■), generate indepen- 
dent random variables Uq, U_i, . . . with marginals 

P(Us e du) ^ P{U e du\(j){xs-i,U) ^ Xs), s==0,-l,..., (5.1) 
where, on the right, Cp{U) — is given by (^). For t ~ 0,1,. . . and x e X, set 
Y^_/\x) :— x and, inductively, 

Y(-*)(x) 0(Yi:l^(x),U,), -t+l<s<0. 

If T < oo is the smallest t such that 

Y^-'\x),x€X, agree (= Xq), (5.2) 

then the algorithm succeeds and reports W :~ X_t as an observation from tt. 
Otherwise, the algorithm fails. 

Remark 5.2 (a) We need only generate Xo,X_i, . . . ,X_f and then impute 
Uo, U_i, . . . , U_t+i using ( |5.lD in order to check whether or not ( ^ ) holds. Thus 
if T < oo, then the algorithm terminates in finite time. 

(b) We omit the detailed description a la Section |[ But the key in setting up 
the first probability space is first to choose W ^ n and Uq, U-i, ... all mutually 
independent and then, having determined the backwards coalescence time T from 
Uo, U-i, . . . , to set X_T := W. 

(c) We may relax the condition that T be the smallest t satisfying ( |5.2| ), via 
the use of coalescence detection events as in Section ^j. In particular, to save 
considerably on computational effort, we may let T' be the smallest t which is a 
power of 2 such th at (p^ ) holds and report X_t' instead. 



(d) Algorithm 5.1, and likewise its variant in remark (c), is interruptible: T 
and W are conditionally independent given success. 

(e) Uniform ergodicity of K is necessary (and, in a weak sense, sufficient) for 



almost sure success of Algorithm 5.1; cf. Remark |3.2|(a) 



5.3 Connection with CFTP. There is a strong and simple connection be- 
tween CFTP and our Algorithm Indeed, suppose we carry out the usual 
CFTP algorithm to sample from w, using kernel K, transition rule (p, and driv- 
ing variables U — {Uo,U-i, . . .). Let T denote the backwards coalescence time 
and let Aq ^ tt denote the terminal state output by CFTP. Let W ^ tt indepen- 
dent of U, and follow the trajectory from X^t := W to Xq; call this trajectory 
X — {X^T, ■ ■ ■ , Xq). Since Xq is determined solely by U, the random variables W 
and Xq are independent. 
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When TT = TT in Algorithm 5.1, the algorithm simply constructs the same prob- 
ability space as for CFTP, but with the ingredients generated in a different chrono- 
logical order: first Xq, X^i, . . . ; then U (which determines T); then W := X-t- 
Again Xq ~ tt and ~ tt are independent. 

Remark 5.3 (a) Because of this statistical independence, it does not matter 



in Algorithm 5.1 that we actually use Xq ^ tt ^ tt. 

(b) The fact (1) that W, unlike Xq, is independent of U, together with (2) that 
T depends solely on U, explains why our algorithm is interruptible and CFTP is 
not. 

(c) In a single run of CFTP, the user would of course be unable to choose 



W ^ n as above, just as in a single run of Algorithm 5.1 we do not actually choose 



Xq ~ TT. So one might regard our described connection between the two algorithms 



as a bit metaphorical. But see Section 7.2 of [14 
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